{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b4b6a552-b7f0-433d-9a70-61c4fcc52d5d",
   "metadata": {},
   "source": [
    "# 快速入门 GPT-4 Vison\n",
    "\n",
    "从历史上看，语言模型系统仅接受**文本**作为输入。但是单一的输入形式，限制了大模型的应用落地范围。\n",
    "\n",
    "随着技术发展，OpenAI 开发的 GPT-4 Turbo with Vision（简称 GPT-4V）允许模型接收**图像**作为输入，并回答关于它们的问题。\n",
    "\n",
    "📢注意，目前在 Assistants API 中使用 GPT-4 时还不支持图像输入。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a701c56-0a2a-4dea-b458-234150b84ff2",
   "metadata": {},
   "source": [
    "## 使用 GPT-4V 识别线上图像（URL）\n",
    "\n",
    "![image_sample](https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bf8689b2-94f2-4a35-a332-9ffed0a56aca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='这幅图呈现了一个宁静美丽的自然景观。图中的主要特点是一条木制的栈道，它穿越在郁郁葱葱的草地上，引向远处看不到的终点。草地上覆盖着茂密的绿色植被，显示出生机勃勃的春夏季节。背景中天空呈现出澄清的蓝色，点缀着几朵轻盈的白云，增添了一种宁静和平和的氛围。\\n\\n整个场景光线充足，阳光和蓝天与绿色植被的对比令人感觉清新愉快。这样的环境可能是理想的徒步旅行地点，提供了亲近自然和放松心情的完美机会。整体上，这幅图传达了大自然的宁静与和谐，是观赏和体验自然美景的一种', role='assistant', function_call=None, tool_calls=None))\n"
     ]
    }
   ],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI()\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "  model=\"gpt-4-turbo\",\n",
    "  messages=[\n",
    "    {\n",
    "      \"role\": \"user\",\n",
    "      \"content\": [\n",
    "        {\"type\": \"text\", \"text\": \"介绍下这幅图?\"},\n",
    "        {\n",
    "          \"type\": \"image_url\",\n",
    "          \"image_url\": {\n",
    "            \"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n",
    "          },\n",
    "        },\n",
    "      ],\n",
    "    }\n",
    "  ],\n",
    "  max_tokens=300,\n",
    ")\n",
    "\n",
    "print(response.choices[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9bcc9026-7485-428f-8269-ea9ae41405cb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'这幅图呈现了一个宁静美丽的自然景观。图中的主要特点是一条木制的栈道，它穿越在郁郁葱葱的草地上，引向远处看不到的终点。草地上覆盖着茂密的绿色植被，显示出生机勃勃的春夏季节。背景中天空呈现出澄清的蓝色，点缀着几朵轻盈的白云，增添了一种宁静和平和的氛围。\\n\\n整个场景光线充足，阳光和蓝天与绿色植被的对比令人感觉清新愉快。这样的环境可能是理想的徒步旅行地点，提供了亲近自然和放松心情的完美机会。整体上，这幅图传达了大自然的宁静与和谐，是观赏和体验自然美景的一种'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fb50a14-fa14-4c63-9f81-b98b0f65d9d9",
   "metadata": {},
   "source": [
    "### 封装成一个函数 query_image_description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c1ca5428-c7e1-4d7e-91f1-d4a05e95ac51",
   "metadata": {},
   "outputs": [],
   "source": [
    "def query_image_description(url, prompt=\"介绍下这幅图?\"):\n",
    "    client = OpenAI()  # 初始化 OpenAI 客户端\n",
    "    \n",
    "    # 发送请求给 OpenAI 的聊天模型\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-4-turbo\",  # 指定使用的模型\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": [\n",
    "                    {\"type\": \"text\", \"text\": prompt},\n",
    "                    {\"type\": \"image_url\", \"image_url\": {\"url\": url}},\n",
    "                ],\n",
    "            }\n",
    "        ],\n",
    "        max_tokens=300,\n",
    "    )\n",
    "    \n",
    "    # 返回模型的响应\n",
    "    return response.choices[0].message.content\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0d0aceb-7cc5-4da1-b6db-e47716ba145a",
   "metadata": {},
   "source": [
    "### 调用函数测试\n",
    "\n",
    "![meme_0](https://p6.itc.cn/q_70/images03/20200602/0c267a0d3d814c9783659eb956969ba1.jpeg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "454abb5c-49d3-42e6-867e-f44e25af5e0e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "这幅图是一种幽默风格的搞笑比较。图中展示了两种不同状态的狗。左边是一只被幻想成拥有非常发达肌肉、类似人类健美运动员身体的柴犬，配文“16岁的我，工作后的我”，表达了年轻时充满活力和梦想的状态。右边的柴犬则显得普通、略显憔悴，配有文字“好累好困、好想摸鱼、看到鸡腿就开心，不属于我、我也有小肚腩、肉肉的很可爱”，描绘了工作后可能感到疲惫、减少锻炼而略显发福的现实状态。\n",
      "\n",
      "这种对比通常用来幽默地表达人们对于理想与现实之间差距的感慨，展示了许多人对于年轻时的憧\n"
     ]
    }
   ],
   "source": [
    "image_url = \"https://p6.itc.cn/q_70/images03/20200602/0c267a0d3d814c9783659eb956969ba1.jpeg\"\n",
    "content = query_image_description(image_url)\n",
    "print(content)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2471306a-84e2-4793-b065-0741fbe57262",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af79850f-83b5-49c4-a3f3-f2c01a28f458",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "63ae05bd-872c-4638-8259-df4f420aaa1d",
   "metadata": {},
   "source": [
    "### 使用 GPT-4V 识别本地图像文件（Base64编码）\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "1e83da68-d387-46da-8236-78fc607d1fab",
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "import base64\n",
    "import requests\n",
    "import json\n",
    "\n",
    "client = OpenAI()  # 初始化 OpenAI 客户端\n",
    "\n",
    "def query_base64_image_description(image_path, prompt=\"解释下图里的内容？\", max_tokens=1000):\n",
    "\n",
    "    # 实现 Base64 编码\n",
    "    def encode_image(path):\n",
    "        with open(path, \"rb\") as image_file:\n",
    "            return base64.b64encode(image_file.read()).decode('utf-8')\n",
    "\n",
    "    # 获取图像的 Base64 编码字符串\n",
    "    base64_image = encode_image(image_path)\n",
    "\n",
    "    # 构造请求的 HTTP Header\n",
    "    headers = {\n",
    "        \"Content-Type\": \"application/json\",\n",
    "        \"Authorization\": f\"Bearer {client.api_key}\"\n",
    "    }\n",
    "\n",
    "    # 构造请求的负载\n",
    "    payload = {\n",
    "        \"model\": \"gpt-4-turbo\",\n",
    "        \"messages\": [\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": [\n",
    "                    {\"type\": \"text\", \"text\": prompt},\n",
    "                    {\"type\": \"image_url\", \"image_url\": {\"url\": f\"data:image/jpeg;base64,{base64_image}\"}}\n",
    "                ]\n",
    "            }\n",
    "        ],\n",
    "        \"max_tokens\": max_tokens\n",
    "    }\n",
    "\n",
    "    # 发送 HTTP 请求\n",
    "    response = requests.post(\"https://api.openai.com/v1/chat/completions\", headers=headers, json=payload)\n",
    "\n",
    "    # 检查响应并提取所需的 content 字段\n",
    "    if response.status_code == 200:\n",
    "        response_data = response.json()\n",
    "        content = response_data['choices'][0]['message']['content']\n",
    "        return content\n",
    "    else:\n",
    "        return f\"Error: {response.status_code}, {response.text}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89dd0f99-8086-473f-80a4-497e6dd07c17",
   "metadata": {},
   "source": [
    "#### 使用 Assistants API生成的 GDP 40年对比曲线图\n",
    "\n",
    "![gdp_data](./images/gdp_1980_2020.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3c0e9063-e8d9-4bc1-ae60-ad0aa5bee32b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "这幅图展示了1980年到2020年间，美国、中国、日本和德国的国内生产总值（GDP）比较。每个国家的GDP以万亿美元为单位标出。从图中可以看出：\n",
      "\n",
      "- **美国（蓝线）**：GDP持续增长，从1980年的约3万亿美金增长到2020年的超过20万亿美金。\n",
      "- **中国（红线）**：自1980年起，GDP增长非常显著，从不到1万亿美金增长至接近15万亿美金。\n",
      "- **日本（紫线）**：1980年至1995年GDP增长明显，之后增长放缓，并在2005年至2015年间保持相对平稳。\n",
      "- **德国（绿线）**：GDP增长较为平稳，从1980年的不到1万亿美金增长至2020年的约4万亿美金。\n",
      "\n",
      "总体而言，图表清晰地表示了这四个经济体在过去四十年间的经济增长情况，其中中国的增长尤其引人注目，显示了其经济的迅速崛起。美国则保持了其全球经济领导地位的增长趋势。日本在90年代后增长放缓，而德国则显示出稳定的增长模式。\n"
     ]
    }
   ],
   "source": [
    "content = query_base64_image_description(\"./images/gdp_1980_2020.jpg\")\n",
    "print(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6d18b227-32a6-4450-86bd-c99ad5c533b9",
   "metadata": {},
   "source": [
    "#### 使用 GPT-4V 识别手写体笔记\n",
    "\n",
    "![](./images/handwriting_0.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "4193fa11-5edd-404c-9472-0cb8cc6799fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "这张图片显示的是一本笔记本上手写的文字，内容主要涉及到自然语言处理（NLP）中的训练技术，如Prompt Tuning和LoRA技术。\n",
      "\n",
      "1. **Prompt Tuning（简要模型调整）**： 提到了使用Prompt Tuning来调整一个小型的Transformer模型。此处解释了输入X（由个别输入X1, X2, ..., Xn组成）。每个输入首先通过一个Embedding过程转换，然后通过Token变换。输出Y是通过矩阵W与转换后的输入X'之间的乘法得出。\n",
      "\n",
      "2. **Prefix Tuning：** 这部分说明了Prefix Tuning的过程，其中添加了前缀权重W_p到原始权重W_j中，得到新的权重W'用于生成输出Y。\n",
      "\n",
      "3. **LoRA调整技术**： 这部分涉及Linear Re-parameterization（线性重新参数化），通过调整矩阵ΔW（通过两个矩阵A和B的乘积表示）来修改权重W。这是一种节省参数调整的方法，使原有模型的W变为W+ΔW，这里也涉及到了一些矩阵运算和优化策略。\n",
      "\n",
      "其中还提到了两个案例分析的存储需求：“LLAMA”需要65GB，而经过LoRA调整的“QLLoRA”仅需要48GB。\n",
      "\n",
      "这些笔记对于理解NLP中一些先进的模型调整技术十分有用，尤其对于需要在资源受限的环境下部署NLP模型的研究人员或实践者。\n"
     ]
    }
   ],
   "source": [
    "content = query_base64_image_description(\"./images/handwriting_0.jpg\")\n",
    "print(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca046601-018c-455c-ace2-41392cbda456",
   "metadata": {},
   "source": [
    "#### 在 Jupyter 标准输出中渲染 Markdown 格式内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "516ee35b-1337-4b22-aea2-ee0adb706098",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "这张图片显示的是一本笔记本上手写的文字，内容主要涉及到自然语言处理（NLP）中的训练技术，如Prompt Tuning和LoRA技术。\n",
       "\n",
       "1. **Prompt Tuning（简要模型调整）**： 提到了使用Prompt Tuning来调整一个小型的Transformer模型。此处解释了输入X（由个别输入X1, X2, ..., Xn组成）。每个输入首先通过一个Embedding过程转换，然后通过Token变换。输出Y是通过矩阵W与转换后的输入X'之间的乘法得出。\n",
       "\n",
       "2. **Prefix Tuning：** 这部分说明了Prefix Tuning的过程，其中添加了前缀权重W_p到原始权重W_j中，得到新的权重W'用于生成输出Y。\n",
       "\n",
       "3. **LoRA调整技术**： 这部分涉及Linear Re-parameterization（线性重新参数化），通过调整矩阵ΔW（通过两个矩阵A和B的乘积表示）来修改权重W。这是一种节省参数调整的方法，使原有模型的W变为W+ΔW，这里也涉及到了一些矩阵运算和优化策略。\n",
       "\n",
       "其中还提到了两个案例分析的存储需求：“LLAMA”需要65GB，而经过LoRA调整的“QLLoRA”仅需要48GB。\n",
       "\n",
       "这些笔记对于理解NLP中一些先进的模型调整技术十分有用，尤其对于需要在资源受限的环境下部署NLP模型的研究人员或实践者。"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from IPython.display import display, Markdown\n",
    "\n",
    "# 使用 display 和 Markdown 函数显示 Markdown 内容\n",
    "display(Markdown(content))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b72ebbe3-87cc-4867-9cf0-62e5ed684482",
   "metadata": {},
   "source": [
    "![](./images/handwriting_1.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7c046958-aa7a-4066-88fa-4134869d9226",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "这张图片展示的是一本笔记本的两页，内容涉及深度学习、特别是关于自然语言处理（NLP）的各种技术和方法。主要讨论了Transformer模型及其改进方法和训练技术。\n",
       "\n",
       "左侧页面的上部标注有“自然语言处理”、“基础”和“评价”，可能是对内容的分类。提到了Transformer模型，并列举了不同的测试标准和指标，如PeFT (“Prompt-based Fine-Tuning”) 和模型性能对比（“Benchmark”）。此外，还提到了不同的方法，如Prompt Tuning和Adapter。具体包括：  \n",
       "- Adapter: 一个2019年Google的研究\n",
       "- Prefix: 代表2021年Stanford的工作\n",
       "- Prompt: 同样是2021年Google的研究\n",
       "- P-Tuning V1和V2：2021年的两种方法\n",
       "- Soft prompts：2021年的研究，提示模板基于模板\n",
       "\n",
       "右侧页面讨论了多模态指令式微调（multi-modality instruction FT）、Llama (3B)、LoRA、PETC（2022年的新技术）等。还有部分文字描述了如何使用prefix-tuning和Adapter方法来细化在大型语言模型（LLMs）中的处理。\n",
       "\n",
       "页面提到了几种语言模型，如：\n",
       "- Llama \n",
       "- BLOOM\n",
       "- ChatGLM \n",
       "- Alpaca\n",
       "\n",
       "这些内容表明这本笔记本的主人正在研究或学习NLP领域的最新技术和方法，特别是如何通过各种微调技术提升已有的大型语言模型的性能。"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "content = query_base64_image_description(\"./images/handwriting_1.jpg\")\n",
    "display(Markdown(content))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "156a0f17-cca8-4f01-9ce5-53384b5ffda4",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3bd772f-9492-4f6c-b05a-666b772ca3c9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8afdeacb-aac1-4692-be2b-fb7957ba5e8f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "79a8d459-d98e-4215-9fbf-38ad37080475",
   "metadata": {},
   "source": [
    "## Homework: \n",
    "\n",
    "\n",
    "### #1\n",
    "\n",
    "使用 GPT-4V 识别带有手写体文字的本地图像文件，分享结果。\n",
    "\n",
    "### #2\n",
    "\n",
    "整合 `query_base64_image_description` 函数和 Markdown 格式渲染方法，使得输出结果更易阅读。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0909bf27-9c4a-498c-9fae-0f442062b9a8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
